Model Selection

Multimodal Understanding

# Multimodal Understanding

Nvidia.cosmos Reason1 7B GGUF

Cosmos-Reason1-7B is a 7B-parameter foundational model released by NVIDIA, specializing in image-to-text tasks.

Large Language Model

Devstral Small Vision 2505 GGUF

Vision encoder based on Mistral Small model, supports image-text generation tasks, compatible with llama.cpp framework

Magma-8B is an image-text-to-text conversion model based on the GGUF format, suitable for multimodal task processing.

A vision-language model specifically designed for Thai-English real-world document parsing, based on the Qwen2.5-VL-Instruction framework

Transformers Supports Multiple Languages

Qwen Qwen2.5 VL 72B Instruct GGUF

A quantized version of the Qwen2.5-VL-72B-Instruct multimodal large language model, supporting image-text-to-text tasks, suitable for various quantization levels from high precision to low memory requirements.

Text-to-Image English

Qwen Qwen2.5 VL 7B Instruct GGUF

A quantized version of Qwen2.5-VL-7B-Instruct, using llama.cpp for quantization, supporting multimodal tasks such as image-to-text conversion.

Text-to-Image English

Vilt Finetuned 100

A vision-language model fine-tuned on VQA datasets based on the ViLT-B32-MLM model

TEMPURA Qwen2.5 VL 3B S1

TEMPURA is a video temporal understanding framework combining causal reasoning with fine-grained temporal segmentation, enhancing video event comprehension through two-stage training

Qwen2.5 Vl 7b Cam Motion Preview

A camera motion analysis model fine-tuned based on Qwen2.5-VL-7B-Instruct, focusing on camera motion classification in videos and video-text retrieval tasks

Gemma 3 12b It Qat Int4 GGUF

Gemma 3 is Google's lightweight open model series based on Gemini technology. The 12B version employs Quantization-Aware Training (QAT) technology, supports multimodal input, and features a 128K context window.

Gemma 3 27b It Qat GGUF

Gemma 3 is a lightweight open model series built by Google based on Gemini technology, supporting multimodal input and text output, featuring a 128K large context window and support for 140+ languages.

Text-to-Image English

Gemma 3 12b It Qat Int4

Gemma 3 is a lightweight open model series from Google, built on the research and technology used to create Gemini models. The 12B version is an instruction-tuned multimodal model supporting text and image inputs to generate text outputs.

A fine-tuned vision-language model based on Salesforce/blip2-opt-2.7b for visual question answering tasks

Blip Custom Captioning

BLIP is a unified vision-language pretraining framework, excelling in vision-language tasks such as image caption generation

Internvl3 8B 6bit

InternVL3-8B-6bit is a vision-language model converted to MLX format, supporting multilingual image-text-to-text tasks.

Transformers Other

Gemma 3 12B It Qat GGUF

Gemma 3 12B IT is a large language model developed by Google, supporting multimodal input and long-context processing.

lmstudio-community

Gemma 3 4B It Qat GGUF

The Gemma 3 4B IT model by Google supports multimodal input and long-context processing, suitable for text generation and image understanding tasks.

lmstudio-community

Gemma 3 27b It Qat 3bit

This model is a 3-bit quantized version converted from google/gemma-3-27b-it-qat-q4_0-unquantized to the MLX format, suitable for image-to-text tasks.

Transformers Other

Gemma 3 27b It Qat 4bit

Gemma 3 27B IT QAT 4bit is an MLX-format model converted from Google's original model, supporting image-to-text tasks.

Transformers Other

Gemma 3 4b It GPTQ 4b 128g

INT4 quantized version based on the gemma-3-4b-it model, significantly reducing storage and computational resource requirements

Gemma 3 1b It Qat Q4 0 Unquantized

Gemma 3 is a lightweight open-source multimodal model series developed by Google, built on Gemini technology, supporting text and image inputs with text outputs. The 1B version has undergone instruction tuning and quantization-aware training (QAT), making it suitable for deployment in resource-constrained environments.

Gemma 3 12b It Qat Q4 0 Unquantized

Gemma 3 is Google's lightweight open-source multimodal model series based on Gemini technology, supporting text and image inputs with text outputs. The 12B version undergoes instruction tuning and quantization-aware training (QAT), making it suitable for deployment in resource-limited environments.

Llama 4 Scout 17B 16E Linearized Bnb Nf4 Bf16

Llama 4 Scout is a 17-billion-parameter Mixture of Experts (MoE) model released by Meta, supporting multilingual text and image understanding with a linearized expert module design for PEFT/LoRA compatibility.

Multimodal Fusion

Transformers Supports Multiple Languages

Gemma 3 4b It Qat Q4 0 GGUF

Gemma is a family of lightweight, cutting-edge open models introduced by Google, built on the same research and technology as the Gemini models. Supports text and image inputs and generates text outputs.

Gemma 3 27b It Qat Autoawq

Gemma 3 is a lightweight, cutting-edge open model series from Google, built on the same technology as Gemini, supporting multimodal input (text/image) and text output. The 27B version significantly reduces memory requirements through quantization-aware training.

Gemma 3 12b It Qat Autoawq

Gemma 3 is Google's lightweight open model series based on Gemini technology, supporting multimodal input and text output.

Gemma 3 27b It Qat Q4 0 Gguf

Gemma 3 is a lightweight open-source multimodal model series by Google, supporting text and image inputs with text generation capabilities. This version is a 27B parameter instruction-tuned model using quantization-aware training, offering lower memory requirements while maintaining near-original quality.

Gemma 3 12b It Qat Q4 0 Gguf

Gemma 3 is a lightweight open model built by Google based on Gemini technology, supporting text and image inputs to generate text outputs. The 12B version is instruction-tuned and suitable for various generation and comprehension tasks.

Llama 4 Maverick 17B 128E Instruct

Llama 4 Maverick is a 17-billion-parameter multimodal AI model developed by Meta, featuring a Mixture of Experts (MoE) architecture, supporting multilingual text and image understanding with 128 expert modules.

Large Language Model

Transformers Supports Multiple Languages

Qwen2.5 VL 7B Instruct Q8 0 GGUF

This model is a GGUF-format conversion of Qwen2.5-VL-7B-Instruct, supporting multimodal tasks and applicable to image and text interaction processing.

Text-to-Image English

Gemma 3 27b It Int4 Gguf

Gemma 3 is a lightweight cutting-edge open model family from Google, built on the same research technology as Gemini models. Supports text/image input and text output, offering both pretrained and instruction-tuned weight versions.

Gemma 3 12b It Int4 Gguf

Gemma 3 is a lightweight multimodal open model from Google that supports text and image inputs with text outputs, featuring a 128K large context window and support for 140+ languages.

Gemma 3 27b It Gguf

A text or image-text to text generation model based on Google's foundation model, providing efficient text generation capabilities.

Large Language Model

Sapnous-6B is an advanced vision-language model that enhances perception and understanding of the world through powerful multimodal capabilities.

Transformers English

Openvlthinker 7B

OpenVLThinker-7B is a vision-language reasoning model specifically designed for multimodal tasks, with particular optimization for solving visual mathematical problems.

Gemma 3 12b It Int4 Awq

Gemma is Google's lightweight cutting-edge open-source model family, built using the same research technology as Gemini models. Gemma 3 is a multimodal model supporting text/image input and text output.

Timezero ActivityNet 7B

TimeZero is a reasoning-guided large-scale vision-language model (LVLM) specifically designed for temporal video grounding (TVG) tasks, achieving dynamic video-language relationship analysis through reinforcement learning methods.

Timezero Charades 7B

TimeZero is a reasoning-guided large vision-language model (LVLM) specifically designed for temporal video grounding (TVG) tasks. It identifies temporal segments in videos corresponding to natural language queries through reinforcement learning methods.

Gemma 3 12b Pt Unsloth Bnb 4bit

Gemma 3 is a lightweight, advanced open model series launched by Google, built on the same research technology as Gemini, supporting multimodal input and text output.

Transformers English

Gemma 3 4b Pt Qat Q4 0 Gguf

Gemma 3 is a lightweight open model series launched by Google, built on the same technology as Gemini, supporting multimodal input and text output.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase